22 research outputs found
Length analysis of speech to be recorded in the recognition of Parkinson's disease
Parkinson's disease is an incurable neurodegenerative disease to the present clinical knowledge. It is diagnosed mostly by exclusion tests. Numerous studies have confirmed that speech can be promising to suspect the presence of the disease. On the other hand, just a few researches discuss the appropriate length of the speech sample or the contribution of parts of the full-length recordings in the classification. Hence, we partitioned each original recording into four shorter samples. We trained linear and radial basis function (rbf) kernel Support Vector Machine (SVM) models separately for original recordings, each partitioned group and all partitioned samples together. We found no significant difference between the results of the rbf kernel models. However, we obtained significantly better results with a portion of the entire speech using linear kernel models. In conclusion, even a shorter piece of a longer speech may be adequate for classification
Effects of language mismatch in automatic forensic voice comparison using deep learning embeddings
In forensic voice comparison the speaker embedding has become widely popular
in the last 10 years. Most of the pretrained speaker embeddings are trained on
English corpora, because it is easily accessible. Thus, language dependency can
be an important factor in automatic forensic voice comparison, especially when
the target language is linguistically very different. There are numerous
commercial systems available, but their models are mainly trained on a
different language (mostly English) than the target language. In the case of a
low-resource language, developing a corpus for forensic purposes containing
enough speakers to train deep learning models is costly. This study aims to
investigate whether a model pre-trained on English corpus can be used on a
target low-resource language (here, Hungarian), different from the model is
trained on. Also, often multiple samples are not available from the offender
(unknown speaker). Therefore, samples are compared pairwise with and without
speaker enrollment for suspect (known) speakers. Two corpora are applied that
were developed especially for forensic purposes, and a third that is meant for
traditional speaker verification. Two deep learning based speaker embedding
vector extraction methods are used: the x-vector and ECAPA-TDNN. Speaker
verification was evaluated in the likelihood-ratio framework. A comparison is
made between the language combinations (modeling, LR calibration, evaluation).
The results were evaluated by minCllr and EER metrics. It was found that the
model pre-trained on a different language but on a corpus with a huge amount of
speakers performs well on samples with language mismatch. The effect of sample
durations and speaking styles were also examined. It was found that the longer
the duration of the sample in question the better the performance is. Also,
there is no real difference if various speaking styles are applied
Effects of emotional speech on forensic voice comparison using deep speaker embeddings
Emotional conditions play a significant role in forensic voice comparison and speaker verification systems. When emotion is present in speech, the verification's performance will deteriorate. In this paper, speaker verification has been investigated and analyzed in the case of emotional speech using metrics evaluating the performance of forensic voice comparison using pre-trained speaker embedding models: x-vector and ECAPA-TDNN for embedded feature extraction. This study investigates whether emotional content affects the forensic voice comparison and verification performance evaluated on a Hungarian speech dataset. The speaker verification performance has been assessed using the likelihood-ratio framework using Cllr and Cllrmin and Equal Error Rate. The ECAPATDNN achieved higher performance than the x-vector. In the same emotion scenario, the best EERs were 2.6% and 7.7% for ECAPA-TDNN and x-vector. Both models are sensitive to the emotional content of the speech samples
Deep learning methods in speaker recognition: a review
This paper summarizes the applied deep learning practices in the field of
speaker recognition, both verification and identification. Speaker recognition
has been a widely used field topic of speech technology. Many research works
have been carried out and little progress has been achieved in the past 5-6
years. However, as deep learning techniques do advance in most machine learning
fields, the former state-of-the-art methods are getting replaced by them in
speaker recognition too. It seems that DL becomes the now state-of-the-art
solution for both speaker verification and identification. The standard
x-vectors, additional to i-vectors, are used as baseline in most of the novel
works. The increasing amount of gathered data opens up the territory to DL,
where they are the most effective
Cross-lingual dysphonic speech detection using pretrained speaker embeddings
In this study, cross-lingual binary classification and severity estimation of dysphonic speech have been carried out. Hand-crafted acoustic feature extraction is replaced by the speaker embedding techniques used in the speaker verification. Two state of art deep learning methods for speaker verification have been used: the X-vector and ECAPA-TDNN. Embeddings are extracted from speech samples in Hungarian and Dutch languages and used to train Support Vector Machine (SVM) and Support Vector Regressor (SVR) for binary classification and severity estimation, in a cross-language manner. Our results were competitive with manual feature engineering, when the models were trained on Hungarian samples and evaluated on Dutch samples in the binary classification of dysphonic speech and outperformed in estimating the severity level of dysphonic speech. Moreover, our model achieved 0.769 and 0.771 in Spearman and Pearson correlations. Also, our results in both classification and regression were superior compared to manual feature extraction technique when models were trained on Dutch samples and evaluated on Hungarian samples with only a limited number of samples are available for training. An accuracy of 86.8% was reached with features extracted from embedding methods, while the maximum accuracy using hand-crafted acoustic features was 66.8%. Overall results show that Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network (ECAPA-TDNN) performs better than the former X-vector in both tasks
Forensic authorship classification by paragraph vectors of speech transcriptions
In forensic comparison, document classification techniques are used mainly for authorship classification and author profiling. In the present study, we aim to introduce paragraph vector modelling (by Doc2Vec) into the likelihoodratio framework paradigm of forensic evidence comparison. Transcriptions of spontaneous speech recording are used as input to paragraph vector extraction model training. Logistic regression models are trained based on cosine distances of paragraph vector pairs to predict the same and different author origin probability. Results are evaluated according to different speaking styles (transcriptions of speech tasks available in the dataset). Cllr and equal error rate values (lowest ones are 0.47 and 0.11, respectively) show that the method can be useful as a feature for forensic authorship comparison and may extend the voice comparison methods for speaker verification
Ügyfél érzelmi állapotának detektálása telefonos ügyfélszolgálati dialógusban
A cikkĂĽnkben egy Ă©rzelem-felismerĂ©si kĂsĂ©rletrl számolunk be, ahol a spontán társalgás során a semlegesrl idegesre, feszĂĽltre megváltozott Ă©rzelmi állapotot kĂvánjuk automatikusan detektálni, telefonon keresztĂĽl. A cĂ©l egy automatikus figyelrendszer kifejlesztĂ©se, amely meghatározza az ĂĽgyfĂ©l elĂ©gedettsĂ©gĂ©nek, vagy elĂ©gedetlensĂ©gĂ©nek a mĂ©rtĂ©kĂ©t. Ehhez a munkához lĂ©trehoztuk, 1000 telefonhĂvás-felvĂ©telbl az Ăşn Magyar Telefonos ĂśgyfĂ©lszolgálati BeszĂ©d Adatbázist (MTĂśBA), amelyben a spontán dialĂłgusok nyelvi tartalmát, valamint frázisonkĂ©nti Ă©rzelmi tartamát jelöltĂĽk be. Az akusztikai elfeldolgozás után az Ă©rzelem-felismerĂ©st support vector machine (SVM) osztályozĂł segĂtsĂ©gĂ©vel vĂ©geztĂĽk. Az SVM osztályozĂłval vĂ©gĂĽl is csak 2 állapotot, egy semleges, Ă©s egy elĂ©gedetlensĂ©get kifejez (ideges Ă©s panaszkodĂł egyĂĽtt) állapotot kĂĽlönböztettĂĽnk meg. Az automatikus figyelrendszer rĂ©szĂ©re kiválasztottunk 15 másodperc hosszĂş figyel ablakot, amelyen belĂĽl összeszámoltuk az elĂ©gedetlensĂ©get jelz frázisok számát. Ez adta meg az elĂ©gedetlensĂ©g mĂ©rtĂ©kĂ©t. Az ablakot 10 másodpercenkĂ©nt lĂ©ptettĂĽk elre a beszĂ©lgetĂ©s folyamán. KĂsĂ©rletezĂ©ssel beállĂthatĂł volt egy olyan elĂ©gedetlensĂ©gi mĂ©rtĂ©k kĂĽszöb, amely felett jelzĂ©s (riasztás) törtĂ©nik. Amennyiben ez a kĂĽszöb a 30%-os elĂ©gedetlensĂ©gi mĂ©rtĂ©k, akkor az átlagos riasztási pontosság 89,6% volt, ami legtöbbször csak a kĂ©zi Ă©s az automatikus riasztás közötti idcsĂşszásbĂłl eredt. ĂŤgy a kifejlesztett automatikus figyelrendszer hasznos eszköz lehet diszpĂ©cser központokban
FORvoice 120+: Statisztikai vizsgálatok Ă©s automatikus beszĂ©lĹ‘ verifikáciĂłs kĂsĂ©rletek idĹ‘ben eltĂ©rĹ‘ felvĂ©telek Ă©s kĂĽlönbözĹ‘ beszĂ©d feladatok szerint
A jelen tanulmányban a FORvoice120+ adatbázison vĂ©gzett akusztikai-fonetikai elemzĂ©seket Ă©s automatikus beszĂ©lĹ‘ azonosĂtási kĂsĂ©rleteket mutatjuk be, a jelenleg elkĂ©szĂĽlt 60 beszĂ©lĹ‘ felvĂ©teleivel. SzemĂ©lyfĂĽggĹ‘ akusztikai jellemzĹ‘k statisztikai vizsgálatait Ă©s automatikus beszĂ©lĹ‘ verifikáciĂłs teszteket vĂ©geztĂĽnk kĂĽlönbözĹ‘ idĹ‘beli Ă©s beszĂ©d tĂpusbeli eltĂ©rĂ©sek elemzĂ©sĂ©re. A statisztikai elemzĂ©seknĂ©l alaphanghoz, formánsokhoz Ă©s beszĂ©d tempĂłhoz kapcsolĂłdĂł akusztikai-fonetikai jellemzĹ‘ket vizsgálatunk. Az eredmĂ©nyek azt mutatták, hogy az eltĂ©rĹ‘ idĹ‘ben törtĂ©nĹ‘ hangrögzĂtĂ©sek alig befolyásolták a jellemzĹ‘k statisztikai Ă©rtĂ©keit, mĂg az eltĂ©rĹ‘ beszĂ©dfeladatoknál jelentĹ‘s eltĂ©rĂ©s volt tapasztalhatĂł. Automatikus beszĂ©lĹ‘ azonosĂtási (verifikáciĂłs) kĂsĂ©rleteket is vĂ©geztĂĽnk ivektor Ă©s x-vektor implementáciĂłkkal. A tesztek alapján elmondhatĂł, hogy minĂ©l hosszabb beszĂ©d szegmenseket alkalmazunk, annál pontosabb lesz a felismerĂ©si eredmĂ©ny
FORvoice 120+ : magyar nyelvű utánkövetĂ©ses adatbázis kriminalisztikai cĂ©lĂş hangösszehasonlĂtásra
A jelen tanulmányban elsĹ‘kĂ©nt kerĂĽl bemutatása a FORvoice 120+ magyar nyelvű kriminalisztikai cĂ©lĂş utánkövetĂ©ses adatbázis. A FORvoice cĂ©lkitűzĂ©se egy kriminalisztikai szempontbĂłl megbĂzhatĂł, követĂ©ses, reprezentatĂv beszĂ©lĹ‘i adatbázis elkĂ©szĂtĂ©se magyar nyelven. Az adatbázis vizsgálati anyagot biztosĂt a magyar nyelven törtĂ©nĹ‘ kriminalisztikai fonetikai kutatásokhoz, illetve a törvĂ©nyszĂ©ki hang-összehasonlĂtĂł rendszerek fejlesztĂ©sĂ©hez Ă©s kiĂ©rtĂ©kelĂ©sĂ©hez. Az adatbázis 120 beszĂ©lĹ‘ (60 nĹ‘i Ă©s 60 fĂ©rfi) felvĂ©telĂ©t fogja tartalmazni. A felvĂ©telek szigorĂş protokoll szerint törtĂ©nnek, amelyek követik a nemzetközi irányokat. A FORvoice lehetĹ‘sĂ©get biztosĂt, hogy azon akusztikai, fonetikai, nyelvĂ©szeti, beszĂ©dtechnolĂłgiai kutatásokat vĂ©gezhessenek, kĂĽlön tekintettel az adatközlĹ‘ egyĂ©ni beszĂ©d tulajdonságára, továbbá a törvĂ©nyszĂ©ki hang-összehasonlĂtĂł rendszerek fejlesztĂ©sĂ©hez Ă©s kiĂ©rtĂ©kelĂ©sĂ©hez, Ăşj, egyĂ©ni akusztikai-fonetikai jellemzĹ‘k megállapĂtásához